title: “EDA of Trending YouTube Video Statistics” author: “Karra Anand” date: “18 March 2018” output: html_document
YouTube (the world-famous video sharing website) maintains a list of the top trending videos on the platform. This dataset is a daily record of the top trending YouTube videos.
This dataset includes several months (and counting) of data on daily trending YouTube videos. Data is included for the USA, with up to 200 listed trending videos per day.
More details about this dataset are present in About_datasat.txt included with this project
Before we move on to plotting and analyzing the data, let us see if the data requires any cleaning.
The first 2 rows of each of the columns of the dataset are as follow:
## video_id trending_date
## 1 2kyS6SvSYSE 17.14.11
## 2 1ZAPwfrtAFY 17.14.11
## title
## 1 WE WANT TO TALK ABOUT OUR MARRIAGE
## 2 The Trump Presidency: Last Week Tonight with John Oliver (HBO)
## channel_title category_id publish_time
## 1 CaseyNeistat 22 2017-11-13T17:13:01.000Z
## 2 LastWeekTonight 24 2017-11-13T07:30:00.000Z
## tags
## 1 SHANtell martin
## 2 last week tonight trump presidency|last week tonight donald trump|john oliver trump|donald trump
## views likes dislikes comment_count
## 1 748374 57527 2966 15954
## 2 2418783 97185 6146 12703
## thumbnail_link comments_disabled
## 1 https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg False
## 2 https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg False
## ratings_disabled video_error_or_removed
## 1 False False
## 2 False False
## description
## 1 SHANTELL'S CHANNEL - https://www.youtube.com/shantellmartin\\nCANDICE - https://www.lovebilly.com\\n\\nfilmed this video in 4k on this -- http://amzn.to/2sTDnRZ\\nwith this lens -- http://amzn.to/2rUJOmD\\nbig drone - http://tinyurl.com/h4ft3oy\\nOTHER GEAR --- http://amzn.to/2o3GLX5\\nSony CAMERA http://amzn.to/2nOBmnv\\nOLD CAMERA; http://amzn.to/2o2cQBT\\nMAIN LENS; http://amzn.to/2od5gBJ\\nBIG SONY CAMERA; http://amzn.to/2nrdJRO\\nBIG Canon CAMERA; http://tinyurl.com/jn4q4vz\\nBENDY TRIPOD THING; http://tinyurl.com/gw3ylz2\\nYOU NEED THIS FOR THE BENDY TRIPOD; http://tinyurl.com/j8mzzua\\nWIDE LENS; http://tinyurl.com/jkfcm8t\\nMORE EXPENSIVE WIDE LENS; http://tinyurl.com/zrdgtou\\nSMALL CAMERA; http://tinyurl.com/hrrzhor\\nMICROPHONE; http://tinyurl.com/zefm4jy\\nOTHER MICROPHONE; http://tinyurl.com/jxgpj86\\nOLD DRONE (cheaper but still great);http://tinyurl.com/zcfmnmd\\n\\nfollow me; on http://instagram.com/caseyneistat\\non https://www.facebook.com/cneistat\\non https://twitter.com/CaseyNeistat\\n\\namazing intro song by https://soundcloud.com/discoteeth\\n\\nad disclosure. THIS IS NOT AN AD. not selling or promoting anything. but samsung did produce the Shantell Video as a 'GALAXY PROJECT' which is an initiative that enables creators like Shantell and me to make projects we might otherwise not have the opportunity to make. hope that's clear. if not ask in the comments and i'll answer any specifics.
## 2 One year after the presidential election, John Oliver discusses what we've learned so far and enlists our catheter cowboy to teach Donald Trump what he hasn't.\\n\\nConnect with Last Week Tonight online...\\n\\nSubscribe to the Last Week Tonight YouTube channel for more almost news as it almost happens: www.youtube.com/user/LastWeekTonight\\n\\nFind Last Week Tonight on Facebook like your mom would: http://Facebook.com/LastWeekTonight\\n\\nFollow us on Twitter for news about jokes and jokes about news: http://Twitter.com/LastWeekTonight\\n\\nVisit our official site for all that other stuff at once: http://www.hbo.com/lastweektonight
In many of the following data cleaning steps only the code but not the output is printed to prevent repeated printing of the same dataset with minor modifications. The final dataset obtained after the data cleaning is printed at the end of this section.
As we are only interested in exploratory analysis of the data, we remove the columns of tags, thumbnail_link and description since they are irrelevant to us.
## video_id trending_date
## 1 2kyS6SvSYSE 17.14.11
## 2 1ZAPwfrtAFY 17.14.11
## 3 5qpjK5DgCt4 17.14.11
## 4 puqaWrEC7tY 17.14.11
## 5 d380meD0W0M 17.14.11
## 6 gHZ1Qz0KiKM 17.14.11
## title
## 1 WE WANT TO TALK ABOUT OUR MARRIAGE
## 2 The Trump Presidency: Last Week Tonight with John Oliver (HBO)
## 3 Racist Superman | Rudy Mancuso, King Bach & Lele Pons
## 4 Nickelback Lyrics: Real or Fake?
## 5 I Dare You: GOING BALD!?
## 6 2 Weeks with iPhone X
## channel_title category_id publish_time views
## 1 CaseyNeistat 22 2017-11-13T17:13:01.000Z 748374
## 2 LastWeekTonight 24 2017-11-13T07:30:00.000Z 2418783
## 3 Rudy Mancuso 23 2017-11-12T19:05:24.000Z 3191434
## 4 Good Mythical Morning 24 2017-11-13T11:00:04.000Z 343168
## 5 nigahiga 24 2017-11-12T18:01:41.000Z 2095731
## 6 iJustine 28 2017-11-13T19:07:23.000Z 119180
## likes dislikes comment_count comments_disabled ratings_disabled
## 1 57527 2966 15954 False False
## 2 97185 6146 12703 False False
## 3 146033 5339 8181 False False
## 4 10172 666 2146 False False
## 5 132235 1989 17518 False False
## 6 9763 511 1434 False False
## video_error_or_removed
## 1 False
## 2 False
## 3 False
## 4 False
## 5 False
## 6 False
## [1] 23362 13
From the dimensions of the dataframe in the above output, we have the data for 23,362 videos(assuming there are no duplicates across 13 features; we shall investigate this the following sections). Hence, we use the data.table data structure to store our data as it superior to a dataframe for add, remove, update, join etc. operations.
yt_trending <- data.table(yt_trending)
Further we observe that, we have names of each of the category_id (as part of US_category_id.json file). Hence, adding the column for category_name.
category_id <- c(1,2,10,15,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,
34,35,36,37,38,39,40,41,42,43,44)
category_name <- c("Film & Animation","Autos & Vehicles","Music",
"Pets & Animals","Sports","Short Movies",
"Travel & Events","Gaming","Videoblogging",
"People & Blogs","Comedy","Entertainment",
"News & Politics","Howto & Style","Education",
"Science & Technology",
"Nonprofits & Activism","Movies","Anime/Animation",
"Action/Adventure","Classics","Comedy",
"Documentary","Drama","Family","Foreign","Horror",
"Sci-Fi/Fantasy","Thriller","Shorts","Shows","Trailers")
category_id_names <- data.frame(category_id,category_name)
yt_trending <- merge(yt_trending,category_id_names)
yt_trending <- yt_trending[order(yt_trending$views,decreasing = TRUE),]
We are most interested here in analyzing the trends in the trending videos, and we do not have a columnn to indicate the number of days a video was on trending, we add that column here.
days_on_trending_video_id <- (count(yt_trending,yt_trending$video_id))[1]
days_on_trending_num_of_days <- (count(yt_trending,yt_trending$video_id))[2]
days_on_trending <- data.table(days_on_trending_video_id,
days_on_trending_num_of_days)
setnames(days_on_trending,old="yt_trending$video_id",new="video_id")
setnames(days_on_trending,old="n",new="days_on_trending")
yt_trending <- merge(yt_trending,days_on_trending)
yt_trending <- yt_trending[order(yt_trending$days_on_trending,
yt_trending$views,
decreasing = TRUE),]
We sort the videos in decreasing order of their number of days on trending.
yt_trending <- unique(yt_trending, by = c("video_id"))
head(yt_trending)
## video_id category_id trending_date
## 1: sXP6vliZIHI 22 18.04.01
## 2: H0g4JxKp4fc 23 18.12.03
## 3: CwKp6Xhy3_4 10 18.12.03
## 4: kCg5D8KMqk4 26 18.07.03
## 5: E_ViwNxUldw 26 18.07.03
## 6: vQiiNGllGQo 15 18.12.03
## title
## 1: Cardi B - Bartier Cardi (feat. 21 Savage) [Official Audio]
## 2: *cough*
## 3: Chris Young - Hangin' On
## 4: MY EVERYDAY MAKEUP ROUTINE
## 5: Clear crisps / Glass Potato Chips
## 6: Elderly man making sure his dog won't get wet
## channel_title publish_time views likes
## 1: Cardi B 2017-12-22T05:00:02.000Z 17540613 380464
## 2: jacksfilms 2018-02-26T19:00:02.000Z 2292736 224986
## 3: ChrisYoungVEVO 2018-02-26T08:00:02.000Z 1117570 7504
## 4: LaurDIY 2018-02-21T23:00:04.000Z 1006188 46829
## 5: My Virgin Kitchen 2018-02-21T20:07:18.000Z 784713 12069
## 6: Rock me, Joey Santiago. 2018-02-26T11:09:32.000Z 713574 12448
## dislikes comment_count comments_disabled ratings_disabled
## 1: 20697 29122 False False
## 2: 8689 41467 False False
## 3: 584 324 False False
## 4: 710 9653 False False
## 5: 1274 1453 False False
## 6: 146 1474 False False
## video_error_or_removed category_name days_on_trending
## 1: False People & Blogs 14
## 2: False Comedy 14
## 3: False Music 14
## 4: False Howto & Style 14
## 5: False Howto & Style 14
## 6: False Pets & Animals 14
## [1] 4712 15
## [1] "video_id" "category_id"
## [3] "trending_date" "title"
## [5] "channel_title" "publish_time"
## [7] "views" "likes"
## [9] "dislikes" "comment_count"
## [11] "comments_disabled" "ratings_disabled"
## [13] "video_error_or_removed" "category_name"
## [15] "days_on_trending"
## video_id category_id trending_date
## 00nmxR1mxIA: 1 Min. : 1.00 18.12.03: 199
## 00RpZZThSAs: 1 1st Qu.:17.00 18.09.01: 141
## 01AEuxSlIMg: 1 Median :24.00 18.01.02: 84
## 02e9klKUN0Y: 1 Mean :20.44 17.13.12: 70
## 02N508BDngc: 1 3rd Qu.:25.00 17.14.11: 69
## 032BPsxhreM: 1 Max. :43.00 17.22.11: 68
## (Other) :4706 (Other) :4081
## title
## DORITOS BLAZE vs. MTN DEW ICE | Super Bowl Commercial with Peter Dinklage and Morgan Freeman: 2
## Justice League - Movie Review : 2
## Maroon 5 - Wait : 2
## Missouri Star Quilt Company Live Stream : 2
## NBA Bloopers - The Starters : 2
## Selena Gomez, Marshmello - Wolves : 2
## (Other) :4700
## channel_title
## The Tonight Show Starring Jimmy Fallon: 51
## ESPN : 46
## TheEllenShow : 44
## Jimmy Kimmel Live : 42
## Netflix : 42
## The Late Show with Stephen Colbert : 41
## (Other) :4446
## publish_time views likes
## 2017-11-17T05:00:00.000Z: 4 Min. : 559 Min. : 0
## 2017-11-17T05:00:01.000Z: 3 1st Qu.: 95075 1st Qu.: 1600
## 2017-12-13T15:00:01.000Z: 3 Median : 331606 Median : 7726
## 2018-01-12T05:00:01.000Z: 3 Mean : 1277663 Mean : 39715
## 2018-02-16T14:00:03.000Z: 3 3rd Qu.: 1025326 3rd Qu.: 25876
## 2017-11-10T05:00:01.000Z: 2 Max. :149376127 Max. :3093544
## (Other) :4694
## dislikes comment_count comments_disabled ratings_disabled
## Min. : 0 Min. : 0.0 False:4633 False:4686
## 1st Qu.: 79 1st Qu.: 238.0 True : 79 True : 26
## Median : 302 Median : 888.5
## Mean : 2598 Mean : 4975.6
## 3rd Qu.: 1058 3rd Qu.: 2914.2
## Max. :1674420 Max. :1361580.0
##
## video_error_or_removed category_name days_on_trending
## False:4711 Entertainment :1141 Min. : 1.000
## True : 1 Music : 585 1st Qu.: 3.000
## News & Politics: 438 Median : 5.000
## Howto & Style : 436 Mean : 4.958
## Comedy : 390 3rd Qu.: 7.000
## People & Blogs : 368 Max. :14.000
## (Other) :1354
As we can see from the above output, we have the data of 4,712 videos and their 15 features.
## Action/Adventure Anime/Animation Autos & Vehicles
## 0 0 66
## Classics Comedy Documentary
## 0 390 0
## Drama Education Entertainment
## 0 186 1141
## Family Film & Animation Foreign
## 0 237 0
## Gaming Horror Howto & Style
## 57 0 436
## Movies Music News & Politics
## 0 585 438
## Nonprofits & Activism People & Blogs Pets & Animals
## 13 368 116
## Science & Technology Sci-Fi/Fantasy Short Movies
## 308 0 0
## Shorts Shows Sports
## 0 2 320
## Thriller Trailers Travel & Events
## 0 0 49
## Videoblogging
## 0
Plotting the number of videos as per their category name, we can see a large amount of variation in the number of videos in each category. We can clearly observe that Entertainment and Music are the top 2 categories with 1141 and 585 videos respectively closely followed by Howto & Style with 436 videos. We can also see their respective shares out of the total.
Also, some categories like Action/Adventure, Anime/Animation etc. have no videos.
## [1] "Summary of views feature:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 559 95070 331600 1278000 1025000 149400000
## [1] "Summary of likes feature:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 1600 7726 39720 25880 3094000
## [1] "Summary of dislikes feature:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 79 302 2598 1058 1674000
## [1] "Summary of comments feature:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 238.0 888.5 4976.0 2914.0 1362000.0
This plot shows the log10 of the number of views, likes, dislikes and comments on the videos.
We can observe a distribution similar to that of a normal distribution. Further, we can see that, the mean number of views is more that that of any other attribute.
The mean, median and maximum number of view are 1,278,000, 331,600 and 149,400,000 respectively.
The mean, median and maximum number of likes are 39,720, 7,726 and 3,094,000 repectively.
The mean, median and maximum number of dislikes are 2,598, 302 and 1,674,000 respectively.
The mean, median and maximum number of comments are 4,976, 888 and 1,362,000 respectively.
## [1] "Summary of videos with comments disabled"
## False True
## 4633 79
## [1] "Summary of videos with ratings disabled"
## False True
## 4686 26
So we have only 79 videos which have their comments disabled and only 26 videos with their ratings disabled.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 5.000 4.958 7.000 14.000
We can observe a bimodal type of distribution for the days on trending.
With videos trending for an average of about 5 days and a maximum of 14 days.
Videos of some categories like Auto & Vehicles, Comedy etc. are not even present in the top 100 trending videos.
## [1] "Summary of views feature:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 192400 792100 1665000 3411000 2939000 45940000
## [1] "Summary of likes feature:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 512 19830 49440 105600 131500 822000
## [1] "Summary of dislikes feature:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 146.0 779.2 1600.0 5052.0 3766.0 165100.0
## [1] "Summary of comments feature:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 1815 4667 11980 10680 203900
For the top 100 trending videos we cannot see a definitive distribution as we could for these same features for the entire distribution.
The mean, median and maximum number of view are 3,411,000, 1,665,000 and 45,940,000 respectively.
The mean, median and maximum number of likes are 105,600, 49,440 and 822,000 repectively.
The mean, median and maximum number of dislikes are 5,052, 1,600 and 165,100 respectively.
The mean, median and maximum number of comments are 11,980, 4,667 and 203,900 respectively.
Comparing these statistics to those for the entire distribution, all of these statistics are higher than those for the entire distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.00 12.00 13.00 12.59 13.00 14.00
Videos in the Top 100 trending videos trend for an average of 13 days with a maximum of 14 days.
The dataset after preprocessing to remove duplicates, contains the data for 4712 videos and 15 features (video_id, category_id, trending_date, title, channel_title, publish_time, views, likes, dislikes, comment_count, comments_disabled, ratings_disabled, video_error_or_removed, category_name, days_on_trending)
Unordered factors: category_id, category_name.
category_id: 1,2,10,15,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33, 34,35,36,37,38,39,40,41,42,43,44
category_name: “Film & Animation”,“Autos & Vehicles”,“Music”, “Pets & Animals”,“Sports”,“Short Movies”, “Travel & Events”,“Gaming”,“Videoblogging”, “People & Blogs”,“Comedy”,“Entertainment”, “News & Politics”,“Howto & Style”,“Education”, “Science & Technology”, “Nonprofits & Activism”,“Movies”,“Anime/Animation”, “Action/Adventure”,“Classics”,“Comedy”, “Documentary”,“Drama”,“Family”,“Foreign”,“Horror”, “Sci-Fi/Fantasy”,“Thriller”,“Shorts”,“Shows”,“Trailers”
Other observations:
The number of views (views), number of days the video is trending (days_on_trending) and category of the video (category_name) are the main features of interest here.
Although subject to the viewers’ biases, number of likes, dislikes and comments can also help in understanding the video’s position on the list of trending videos.
Yes, the variable category_name was created to better interpret the category_id feature which has a direct correspondance with category_name.
Also, the days_on_trending variable was created to keep track of the number of days the video was on trending.
One of the unusual observations were that there were no videos from some of the categories like Action/Adventure, Anime/Animation etc.
Some features like tags, thubnail_link and description were removed from the dataset as they took a lot of space on printing and were irrelevant to our analysis.
Log10 transform was applied in multiple plots to convey the scale and variation in the data as required.
Also, the same video which as trending on nultiple days was reduced to a single entry on the day it had the highest views.
The above correlation matrix helps in identifying some of the interesting trends in the data.
We have a high correlation between
But, before we plot scatter plots to visualize these correlations, we have to normalize the data ranges of the above mentioned four features.
After normalizing to be in the range of [0,1]. We get the following output:
## views likes dislikes comment_count
## 1: 0.117422509 0.122986452 1.236070e-02 0.0213883870
## 2: 0.015345060 0.072727590 5.189260e-03 0.0304550596
## 3: 0.007477869 0.002425697 3.487775e-04 0.0002379588
## 4: 0.006732219 0.015137654 4.240274e-04 0.0070895577
## 5: 0.005249547 0.003901351 7.608605e-04 0.0010671426
## 6: 0.004773304 0.004023864 8.719437e-05 0.0010825658
Using suitable limits for the X and Y axis:
We can clearly observe the correlation that we found out previously using the correlation matrix.
We can see the variation in the features of likes, dislikes, comment count and days on trending in the following plots.
The notable but not considerable correlations like in between views and dislikes and that between views and comment count are visible here.
As the number of views increases, the other features also increase which is in agreement with our calculated statistics in the univariate plots section.
Moreover, only a small number of videos trend for 10 days or more.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
The categories of Entertainment and Music have very high values for all the features across the board.
Nonprofits & Activism have the lowest median and 1st quantile values for nearly all the features.
Shows show a very small difference between the 3rd and 1st quantile for all the features. They also have the highest median, 1st and 3rd quantile value for days on trending.
In the univariated plots and analysis sections, we observed that the trends in the entire dataset were more pronounced in the Top 100 Trending Videos.
We check if this is true for the correlations that we observed at the beginning of the bivariate plots section.
The changes in the correlations from those of the entire dataset are:
Again before plotting, we first normalize the values of the features to be in the range [0,1].
## views likes dislikes comment_count
## 1: 0.37922890 0.462533127 0.124579451 0.142836123
## 2: 0.04591265 0.273262573 0.051787371 0.203385258
## 3: 0.02022370 0.008511685 0.002655141 0.001589139
## 4: 0.01778891 0.056383824 0.003418948 0.047345549
## 5: 0.01294750 0.014068870 0.006837897 0.007126601
## 6: 0.01139241 0.014530244 0.000000000 0.007229601
We can observe the correlation that we found from the correlation matrix above.
The data points are further away from the y=x line and hence overall, with views the other features are less correlated.
Also, videos in the Top 100 Trending Videos trend for either 12, 13 or 14 days only.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
The plots from Top 100 Trending Videos are on the left and those from the entire dataset are on the right.
The subsetting of data for the Top 100 Trending Videos is clearly visible with the absence of a lot of data points.
The trend of Entertainment and Music having values across all the features still continues but is a lot less pronounced with other categories like Film/Animation and People & Blogs coming close.
Pets & Animals is surprising the category with the highest median, 1st and 3rd quantile values in the Top 100 Trending Videos.
We have a high correlation between:
in the dataset containing the all of the data but not in the Top 100 Trending Videos.
Nonprofits & Activism have the lowest median and 1st quantile values for nearly all the features.
Shows show a very small difference between the 3rd and 1st quantile for all the features. They also have the highest median, 1st and 3rd quantile value for days on trending.
Nonprofits & Activism have the lowest median and 1st quantile values for nearly all the features.
Shows show a very small difference between the 3rd and 1st quantile for all the features. They also have the highest median, 1st and 3rd quantile value for days on trending.
In the Top 100 Trending Videos, the trend of Entertainment and Music having values across all the features still continues but is a lot less pronounced with other categories like Film/Animation and People & Blogs coming close.
Pets & Animals is surprising the category with the highest median, 1st and 3rd quantile values in the Top 100 Trending Videos.
In the Top 100 Trending Videos subset of the dataset, the Pets & Animals category has the highest median, 1st and 3rd quantile values.
The strongest correlation was between views and likes with 0.83 in the complete dataset and 0.85 in the Top 100 Trending Videos.